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Abstract 

Variational  inference  algorithms  provide  the  most  effective  framework  for  large- 
scale  training  of  Bayesian  nonparametric  models.  Stochastic  online  approaches 
are  promising,  but  are  sensitive  to  the  chosen  learning  rate  and  often  converge 
to  poor  local  optima.  We  present  a  new  algorithm,  memoized  online  variational 
inference,  which  scales  to  very  large  (yet  finite)  datasets  while  avoiding  the  com¬ 
plexities  of  stochastic  gradient.  Our  algorithm  maintains  finite-dimensional  suf¬ 
ficient  statistics  from  batches  of  the  full  dataset,  requiring  some  additional  mem¬ 
ory  but  still  scaling  to  millions  of  examples.  Exploiting  nested  families  of  varia¬ 
tional  bounds  for  infinite  nonparametric  models,  we  develop  principled  birth  and 
merge  moves  allowing  non-local  optimization.  Births  adaptively  add  components 
to  the  model  to  escape  local  optima,  while  merges  remove  redundancy  and  im¬ 
prove  speed.  Using  Dirichlet  process  mixture  models  for  image  clustering  and 
denoising,  we  demonstrate  major  improvements  in  robustness  and  accuracy. 


1  Introduction 

Bayesian  nonparametric  methods  provide  a  flexible  framework  for  unsupervised  modeling  of  struc¬ 
tured  data  like  text  documents,  time  series,  and  images.  They  are  especially  promising  for  large 
datasets,  as  their  nonparametric  priors  should  allow  complexity  to  grow  smoothly  as  more  data  is 
seen.  Unfortunately,  contemporary  inference  algorithms  do  not  live  up  to  this  promise,  scaling 
poorly  and  yielding  solutions  that  represent  poor  local  optima  of  the  true  posterior.  In  this  paper,  we 
propose  new  scalable  algorithms  capable  of  escaping  local  optima.  Our  focus  is  on  clustering  data 
via  the  Dirichlet  process  (DP)  mixture  model,  but  our  methods  are  much  more  widely  applicable. 

Stochastic  online  variational  inference  is  a  promising  general-purpose  approach  to  Bayesian  non¬ 
parametric  learning  from  streaming  data  [1].  While  individual  steps  of  stochastic  optimization 
algorithms  are  by  design  scalable,  they  are  extremely  vulnerable  to  local  optima  for  non-convex 
unsupervised  learning  problems,  frequently  yielding  poor  solutions  (see  Fig.  2).  While  taking  the 
best  of  multiple  runs  is  possible,  this  is  unreliable,  expensive,  and  ineffective  in  more  complex 
structured  models.  Furthermore,  the  noisy  gradient  step  size  (or  learning  rate)  requires  external  pa¬ 
rameters  which  must  be  fine-tuned  for  best  performance,  often  requiring  an  expensive  validation 
procedure.  Recent  work  has  proposed  methods  for  automatically  adapting  learning  rates  [2],  but 
these  algorithms’  progress  on  the  overall  variational  objective  remains  local  and  non-monotonic. 

In  this  paper,  we  present  an  alternative  algorithm,  memoized  online  variational  inference,  which 
avoids  noisy  gradient  steps  and  learning  rates  altogether.  Our  method  is  useful  when  all  data  may 
not  fit  in  memory,  but  we  can  afford  multiple  full  passes  through  the  data  by  processing  successive 
batches.  The  algorithm  visits  each  batch  in  turn  and  updates  a  cached  set  of  sufficient  statistics  which 
accurately  reflect  the  entire  dataset.  This  allows  rapid  and  noise-free  updates  to  global  parameters 
at  every  step,  quickly  propagating  information  and  speeding  convergence.  Our  memoized  approach 
is  generally  applicable  in  any  case  batch  or  stochastic  online  methods  are  useful,  including  topic 
models  [1]  and  relational  models  [3],  though  we  do  not  explore  these  here. 
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We  further  develop  a  principled  framework  for  escaping  local  optima  in  the  online  setting,  by  inte¬ 
grating  birth  and  merge  moves  within  our  algorithm’s  coordinate  ascent  steps.  Most  existing  mean- 
field  algorithms  impose  a  restrictive  fixed  truncation  in  the  number  of  components,  which  is  hard  to 
set  a  priori  on  big  datasets:  either  it  is  too  small  and  inexpressive,  or  too  large  and  computationally 
inefficient.  Our  birth  and  merge  moves,  together  with  a  nested  variational  approximation  to  the  pos¬ 
terior,  enable  adaptive  creation  and  pruning  of  clusters  on-the-fly.  Because  these  moves  are  validated 
by  an  exactly  tracked  global  variational  objective,  we  avoid  potential  instabilities  of  stochastic  on¬ 
line  split-merge  proposals  [4].  The  structure  of  our  moves  is  very  different  from  split-merge  MCMC 
methods  [5,  6];  applications  of  these  algorithms  have  been  limited  to  hundreds  of  data  points,  while 
our  experiments  show  scaling  of  memoized  split-merge  proposals  to  millions  of  examples. 

We  review  the  Dirichlet  process  mixture  model  and  variational  inference  in  Sec.  2,  outline  our  novel 
memoized  algorithm  in  Sec.  3,  and  evaluate  on  clustering  and  denoising  applications  in  Sec.  4. 

2  Variational  inference  for  Dirichlet  process  mixture  models 

The  Dirichlet  process  (DP)  provides  a  nonparametric  prior  for  partitioning  exchangeable  datasets 
into  discrete  clusters  [7].  An  instantiation  G  of  a  DP  is  an  infinite  collection  of  atoms,  each  of  which 
represents  one  mixture  component.  Component  k  has  mixture  weight  Wk  sampled  as  follows: 

oo  k—1 

G^DP{aoH),  G  =  ^Wk5tj>^,  Wfc  ~  Beta(l,  ao),  wu  =  Vk^il  -  vi).  (1) 

k=l  1=1 

This  stick-breaking  process  provides  mixture  weights  and  parameters.  Each  data  item  n  chooses 
an  assignment  Zn  ~  Cat(w),  and  then  draws  observations  Xn  ^  F{4>zn)-  The  data-generating 
parameter  (j)k  is  drawn  from  a  base  measure  H  with  natural  parameters  Aq.  We  assume  both  H  and 
F  belong  to  exponential  families  with  log-normalizers  a  and  sufficient  statistics  t: 

I  Ao)  =  exp  |Aofo(^fc)  -  ao(Ao)|,  p(a;„  |  ^fc)  =  exp  -  a((/)fe)|.  (2) 

For  simplicity,  we  assume  unit  reference  measures.  The  goal  of  inference  is  to  recover  stick-breaking 
proportions  Vk  and  data-generating  parameters  (j)k  for  each  global  mixture  component  k,  as  well  as 
discrete  cluster  assignments  z  =  {zn}n=i  for  s^ch  observation.  The  joint  distribution  is 

N  OO 

p{x,z,(f),v)  =  Y\_P{xn  I  (jiz jCat{zn  \  w{v))  Beta(r;fe  |  l,ao)H{(j)k  \  Xo)  (3) 

n—1  k—1 

While  our  algorithms  are  directly  applicable  to  any  DP  mixture  of  exponential  families,  our  experi¬ 
ments  focus  on  G-dimensional  real-valued  data  Xn,  for  which  we  take  F  to  be  Gaussian.  For  some 
data,  we  consider  full-mean,  full-covariance  analysis  (where  FI  is  normal- Wishart),  while  other  ap¬ 
plications  consider  zero-mean,  full-covariance  analysis  (where  FI  is  Wishart). 

2.1  Mean-field  variational  inference  for  DP  mixture  models 

To  approximate  the  full  (but  intractable)  posterior  over  variables  z,v,<j),  we  consider  a  fully- 
factorized  variational  distribution  q,  with  individual  factors  from  appropriate  exponential  families:^ 

N  K 

q{z,v,(j>)  =  q{zn\fn)  q{vk\ai,ao)q{(j)k\Xk),  (4) 

n—1  k—1 

q{Zn)  =  Cat{zn\fni,...rnK),  g(ufe)  =  Beta(r;fc  |  dfci ,  dfco),  qiM  =  H  {(l>k  \  Xk)  ■  (5) 

To  tractably  handle  the  infinite  set  of  components  available  under  the  DP  prior,  we  truncate  the 
discrete  assignment  factor  to  enforce  q{zn  =  k)  =  Q  for  k  >  K.  This  forces  all  data  to  be  explained 
by  only  the  first  K  components,  inducing  conditional  independence  between  observed  data  and  any 
global  parameters  Vk,  4>k  with  index  k  >  K.  Inference  may  thus  focus  exclusively  on  a  finite  set  of 
K  components,  while  reasonably  approximating  the  true  infinite  posterior  for  large  K. 

*To  ease  notation,  we  mark  variables  with  hats  to  distinguish  parameters  9  of  variational  factors  q  from 
parameters  9  of  the  generative  model  p.  In  this  way,  9k  and  9k  always  have  equal  dimension. 
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Crucially,  our  truncation  is  nested:  any  learned  q  with  truncation  K  can  be  represented  exactly  under 
truncation  K  +  lhy  setting  the  final  component  to  have  zero  mass.  This  truncation,  previously  advo¬ 
cated  by  [8,  4],  has  considerable  advantages  over  non-nested  direct  truncation  of  the  stick-breaking 
process  [7],  which  places  artifically  large  mass  on  the  final  component.  It  is  more  efficient  and 
broadly  applicable  than  an  alternative  trunction  which  sets  the  stick-breaking  “tail”  to  its  prior  [9]. 


Variational  algorithms  optimize  the  parameters  of  q  to  minimize  the  KL  divergence  from  the  true, 
intractable  posterior  [7].  The  optimal  q  maximizes  the  evidence  lower  bound  (ELBO)  objective  C: 


logp(x  I  ao,  Ao)  >  C{q)  =  E, 


\ogp{yi,v,z,,(j)  I  ao,  Ao)  -  log  q{v,  2,41) 


(6) 


For  DP  mixtures  of  exponential  family  distributions,  C{q)  has  a  simple  form.  For  each  component  k, 
we  store  its  expected  mass  Nk  and  expected  sufficient  statistic  Sfc(x).  All  but  one  term  in  C{q)  can 
then  be  written  using  only  these  summaries  and  expectations  of  the  global  parameters  v,  </>: 


Nk  =  E„ 


N  N  N  N 

=  '^Tnk,  5fc(x)  = 


K 


n—1 
N 


=  i:  Eg[</>fc]’^Sfc(x)  -  NkEq[a{(l)k)]  +  iVfcEq[log -  E  Tnk  log  fnk 


n—1 


+  En 


log 


Beta(?;fc  |  l,ao) 


q{vk  I  afci,afeo) 


+  Eq 


log 


Hi4>k  I  Ao) 
q{4>k  I  Afe)  J 


(7) 


(8) 


Excluding  the  entropy  term  —  ^  f^k  log  f„fe  which  we  discuss  later,  this  bound  is  a  simple  linear 
function  of  the  summaries  iVfc,  Sfc(x).  Given  precomputed  entropies  and  summaries,  evaluation  of 
C{q)  can  be  done  in  time  independent  of  the  data  size  N .  We  next  review  variational  algorithms 
for  optimizing  q  via  coordinate  ascent,  iteratively  updating  individual  factors  of  q.  We  describe 
algorithms  in  terms  of  two  updates  [1]:  global  parameters  (stick-breaking  proportions  Vk  and  data- 
generating  parameters  (j>k),  and  local  parameters  (assignments  of  data  to  components  Zn)- 


2.2  Full-dataset  variational  inference 

Standard  full-dataset  variational  inference  [7]  updates  local  factors  q{zn  \  f„)  for  all  observations 
n  =  1, . . . ,  by  visiting  each  item  n  and  computing  the  fraction  fnk  explained  by  component  k: 

fnk  =  exp  (E5[logu;fc(u)]  +  E5[logp(a;„  |  ^fc)] ),  fnk  =  — •  (9) 

^  ^  Ee=i  rni 

Next,  we  update  global  factors  g(r’fc|dfci,  a^o),  (z(</>fc|Afc)  for  each  component  k.  After  computing 
summary  statistics  Nk,  Sk^x.)  given  the  new  fnk  via  Eq.  (7),  the  update  equations  become 

K 

aki=  cti-f  Nk,  dfco  =  ao  +  ^  Ni,  Afc  =  Aq -f  Sfc (x) .  (10) 

l=k+l 

While  simple  and  guaranteed  to  converge,  this  approach  scales  poorly  to  big  datasets.  Because 
global  parameters  are  updated  only  after  a  full  pass  through  the  data,  information  propagates  slowly. 

2.3  Stochastic  online  variational  inference 

Stochastic  online  (SO)  variational  inference  scales  to  huge  datasets  [1].  Instead  of  analyzing  all  data 
at  once,  SO  processes  only  a  subset  (“batch”)  Bt  at  each  iteration  t.  These  subsets  are  assumed 
sampled  uniformly  at  random  from  a  larger  (but  fixed  size  N)  corpus.  Given  a  batch,  SO  first 
updates  local  factors  q{zn)  for  n  &  Bt  via  Eq.  (9).  It  then  updates  global  factors  via  a  noisy  gradient 
step,  using  sufficient  statistics  of  q{zn)  from  only  the  current  batch.  These  steps  optimize  a  noisy 
function,  which  in  expectation  (with  respect  to  batch  sampling)  converges  to  the  true  objective  (6). 

Natural  gradient  steps  are  computationally  tractable  for  exponential  family  models,  involving  nearly 
the  same  computations  as  the  full-dataset  updates  [1].  For  example,  to  update  the  variational  param¬ 
eter  Afe  from  (5)  at  iteration  t,  we  first  compute  the  global  update  given  only  data  in  the  current  batch. 
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amplified  to  be  at  full-dataset  scale:  =  Aq  -I-  Then,  we  interpolate  between  this  and 

the  previous  global  parameters  to  arrive  at  the  final  result:  A^*^  ^  Pt^k  +  (1  ~  Pt)^t  learn¬ 

ing  rate  pt  controls  how  “forgetful”  the  algorithm  is  of  previous  values;  if  it  decays  at  appropriate 
rates,  stochastic  inference  provahly  converges  to  a  local  optimum  of  the  global  objective  C{q)  [1]. 

This  online  approach  has  clear  computational  advantages  and  can  sometimes  yield  higher  quality 
solutions  than  the  full-data  algorithm,  since  it  conveys  information  between  local  and  global  pa¬ 
rameters  more  frequently.  However,  performance  is  extremely  sensitive  to  the  learning  rate  decay 
schedule  and  choice  of  batch  size,  as  we  demonstrate  in  later  experiments. 


3  Memoized  online  variational  inference 


Generalizing  previous  incremental  variants  of  the  expectation  maximization  (EM)  algorithm  [10],  we 
now  develop  our  memoized  online  variational  inference  algorithm.  We  divide  the  data  into  B  fixed 
batches  For  each  batch,  we  maintain  memo/zet/ sufficient  statistics  =  [Nk{Bb),  Sk{Bb)] 

for  each  component  fc.  We  also  track  the  full-dataset  statistics  S'®  =  [7Vfe,Sfc(x)].  These  compact 
summary  statistics  allow  guarantees  of  correct  full-dataset  analysis  while  processing  only  one  small 
batch  at  a  time.  Our  approach  hinges  on  the  fact  that  these  sufficient  statistics  are  additive:  sum¬ 
maries  of  an  entire  dataset  can  be  written  exactly  as  the  addition  of  summaries  of  distinct  batches. 
Note  that  our  memoization  of  deterministic  analyses  of  batches  of  data  is  distinct  from  the  stochastic 
memoization,  or  “lazy”  instantiation,  of  random  variables  in  some  Monte  Carlo  methods  [11,  12]. 

Memoized  inference  proceeds  by  visiting  (in  random  order)  each  distinct  batch  once  in  a  full  pass 
through  the  data,  incrementally  updating  the  local  and  global  parameters  related  to  that  batch  h. 
First,  we  update  local  parameters  for  the  current  batch  iq{zn  \  fn)  for  n  G  Bt)  via  Eq.  (9).  Next,  we 
update  cached  global  sufficient  statistics  for  each  component:  we  subtract  the  old  (cached)  summary 
of  batch  6,  compute  a  new  batch-level  summary,  and  add  the  result  to  the  full-dataset  summary: 


Bk,  Sk  ■(—  1^  ^  fnk,  ^  fnkt{Xn)  , 
n^Bh  nGBb 


S°k^S°k  +  si 


(11) 


Finally,  given  the  new  full-dataset  summary  S'®,  we  update  global  parameters  exactly  as  in  Eq.  (10). 
Unlike  stochastic  online  algorithms,  memoized  inference  is  guaranteed  to  improve  the  full-dataset 
EFBO  at  every  step.  Correctness  follows  immediately  from  the  arguments  in  [10].  By  construction, 
each  local  or  global  step  will  result  in  a  new  q  that  strictly  increases  the  objective  C{q). 

In  the  limit  where  B  =  1,  memoized  inference  reduces  to  standard  full-dataset  updates.  However, 
given  many  batches  it  is  far  more  scalable,  while  maintaining  all  guarantees  of  batch  inference.  Fur¬ 
thermore,  it  generally  converges  faster  than  the  full-dataset  algorithm  due  to  frequently  interleaving 
global  and  local  updates.  Provided  we  can  store  memoized  sufficient  statistics  for  each  batch  (not 
each  observation),  memoized  inference  has  the  same  computational  complexity  as  stochastic  meth¬ 
ods  while  avoiding  noise  and  sensitivity  to  learning  rates.  Recent  analysis  of  convex  optimization 
algorithms  [13]  demonstrated  theoretical  and  practical  advantages  for  methods  that  use  cached  full- 
dataset  summaries  to  update  parameters,  as  we  do,  instead  of  stochastic  current-batch-only  updates. 

This  memoized  algorithm  can  compute  the  full-dataset  objective  C{q)  exactly  at  any  point  (after 
visiting  all  items  once).  To  do  so  efficiently,  we  need  to  compute  and  store  the  assignment  entropy 
Hk  =  -  J2n€Bb  ’""fc  log  Tnk  after  visiting  each  batch  b.  We  also  need  to  track  the  full-data  entropy 

7/®  =  Btl  which  is  additive  just  like  the  sufficient  statistics  and  incrementally  updated  after 

each  batch.  Given  both  and  S'®,  evaluation  of  the  full-dataset  EFBO  in  Eq.  (8)  is  exact  and  rapid. 


3.1  Birth  moves  to  escape  local  optima 

We  now  propose  additional  birth  moves  that,  when  interleaved  with  conventional  coordinate  ascent 
parameter  updates,  can  add  useful  new  components  to  the  model  and  escape  local  optima.  Previous 
methods  [14,  9,  4]  for  changing  variational  truncations  create  just  one  extra  component  via  a  “split” 
move  that  is  highly-specialized  to  particular  likelihoods.  Wang  and  Blei  [15]  explore  truncation 
levels  via  a  local  collapsed  Gibbs  sampler,  but  samplers  are  slow  to  make  large  changes.  In  contrast, 
our  births  add  many  components  at  once  and  apply  to  any  exponential  family  mixture  model. 
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Figure  1 :  One  pass  through  a  toy  dataset  for  memoized  learning  with  birth  and  merge  moves  (MO-BM), 
showing  creation  (left)  and  adoption  (right)  of  new  components.  Left:  Scatter  plot  of  2D  observed  data,  and  a 
subsample  targeted  via  the  first  mixture  component.  Elliptical  contours  show  component  covariance  matrices. 
Right:  Bar  plots  of  memoized  counts  for  each  batch.  Not  shown:  Memoized  sufficient  statistics 


Creating  new  components  in  the  online  setting  is  challenging.  Each  small  batch  may  not  have 
enough  examples  of  a  missing  component  to  inspire  a  good  proposal,  even  if  that  component  is  well- 
supported  by  the  full  dataset.  We  thus  advocate  birth  moves  that  happen  in  three  phases  (collection, 
creation,  and  adoption)  over  two  passes  of  the  data.  The  first  pass  collects  a  targeted  data  sample 
more  likely  to  yield  informative  proposals  than  a  small,  predefined  batch.  The  second  pass,  shown  in 
Fig.  1,  creates  new  components  and  then  updates  every  batch  with  the  expanded  model.  Successive 
births  are  interleaved;  at  each  pass  there  are  proposals  both  active  and  in  preparation.  We  sketch  out 
each  step  of  the  algorithm  below.  For  complete  details,  see  the  supplement. 

Collection  During  pass  1 ,  we  collect  a  targeted  subsample  x'  of  the  data,  of  size  at  most  N'  =  10^. 
This  subsample  targets  a  single  component  k' .  When  visiting  each  batch,  we  copy  data  a;„  into  x'  if 
fnk'  >  T  (we  set  T  =  0.1).  This  threshold  test  ensures  the  subsample  contains  related  data,  but  also 
promotes  diversity  by  considering  data  explained  partially  by  other  components  k  k' .  Targeted 

samples  vary  from  iteration  to  iteration  because  batches  are  visited  in  distinct,  random  orders. 
Creation  Before  pass  2,  we  create  new  components  by  fitting  a  DP  mixture  model  with  K'  (we 
take  K'  —  10)  components  to  x',  running  variational  inference  for  a  limited  budget  of  iterations. 
Taking  advantage  of  our  nested  truncation,  we  expand  our  current  model  to  include  all  K  +  K'  com¬ 
ponents,  as  shown  in  Fig.  1.  Unlike  previous  work  [9, 4],  we  do  not  immediately  assess  the  change  in 
EFBO  produced  by  these  new  components,  and  always  accept  them.  We  rely  on  subsequent  merge 
moves  (Sec.  3.2)  to  remove  unneeded  components. 

Adoption  During  pass  2,  we  visit  each  batch  and  perform  local  and  global  parameter  updates  for 
the  expanded  {K  +  AT') -component  mixture.  These  updates  use  expanded  global  summaries  5° 
that  include  summaries  S"  from  the  targeted  analysis  of  x'.  This  results  in  two  interpretations  of  the 
subset  x':  assignment  to  original  components  (mostly  k')  and  assignment  to  brand-new  components. 
If  the  new  components  are  favored,  they  will  gain  mass  via  new  assignments  made  at  each  batch. 
After  the  pass,  we  subtract  away  S'  to  yield  both  S'®  and  global  parameters  exactly  consistent  with 
the  data  x.  Any  nearly-empty  new  component  will  likely  be  pruned  away  by  later  merges. 

By  adding  many  components  at  once,  our  birth  move  allows  rapid  escape  from  poor  local  optima. 
Alone,  births  may  sometimes  cause  a  slight  EFBO  decrease  by  adding  unnecessary  components. 
However,  in  practice  merge  moves  reliably  reject  poor  births  and  restore  original  configurations.  In 
Sec.  4,  births  are  so  effective  that  runs  started  at  AT  =  1  recover  necessary  components  on-the-fly. 

3.2  Merge  moves  that  optimize  the  full  data  objective 

The  computational  cost  of  inference  grows  with  the  number  of  components  AT.  To  keep  AT  small,  we 
develop  merge  moves  that  replace  two  components  with  a  single  merged  one.  Merge  moves  were 
first  explored  for  batch  variational  methods  [16,  14].  For  hierarchical  DP  topic  models,  stochastic 
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Figure  2:  Comparison  of  full-data,  stochastic  (SO),  and  memoized  (MO)  on  toy  data  with  K  =  8  true 
components  (Sec.  4.1).  Top:  Trace  of  ELBO  during  training  across  10  runs.  SO  compared  with  learning  rates 
a,b,c.  Bottom  Left:  Example  patch  generated  by  each  component.  Bottom:  Covariance  matrix  and  weights  Wk 
found  by  one  run  of  each  method,  aligned  to  true  components.  “X”:  no  comparable  component  found. 


variational  inference  methods  have  been  augmented  to  evaluate  merge  proposals  based  on  noisy, 
single-batch  estimates  of  the  ELBO  [4].  This  can  result  in  accepted  merges  that  decrease  the  full- 
data  objective  (see  Sec.  4.1  for  an  empirical  illustration).  In  contrast,  our  algorithm  accurately 
computes  the  full  ELBO  for  each  merge  proposal,  ensuring  only  useful  merges  are  accepted. 

Given  two  components  fca,  kb  to  merge,  we  form  a  candidate  q'  with  K  —  1  components,  where 
merged  component  km  takes  over  all  assignments  to  ka,kb\  fnkm  =  ^nka,  +  ^nkb-  Instead  of  com¬ 
puting  new  assignments  explicitly,  additivity  allows  direct  construction  of  merged  global  sufficient 
statistics:  =  5°^  -I-  Merged  global  parameters  follow  from  Eq.  (10). 

Our  merge  move  has  three  steps:  select  components,  form  the  candidate  configuration  q',  and  accept 
q'  if  the  ELBO  improves.  Selecting  ka,  kb  to  merge  at  random  is  unlikely  to  yield  an  improved 
configuration.  After  choosing  ka  at  random,  we  select  kb  using  a  ratio  of  marginal  likelihoods  M 
which  compares  the  merged  and  separated  configurations,  easily  computed  with  cached  summaries: 

pikb  I  ka)  oc  M{Sk)  =  exp(aoiXo  +  Sk{x))y  (12) 

Our  memoized  approach  allows  exact  evaluation  of  the  full-data  ELBO  to  compare  the  existing  q  to 
merge  candidate  q'.  As  shown  in  Eq.  (8),  evaluating  C{q')  is  a  linear  function  of  merged  sufficient 

statistics,  except  for  the  assignment  entropy  term:  Hab  =  -  J2n=ii^rika.  +  fnkj  ^og{fnka.  +  rnk^)- 
We  compute  this  term  in  advance  for  all  possible  merge  pairs.  This  requires  storing  one  set  of 
K[K  —  l)/2  scalars,  one  per  candidate  pair,  for  each  batch.  This  modest  precomputation  allows 
rapid  and  exact  merge  moves,  which  improve  model  quality  and  speed-up  post-merge  iterations. 

In  one  pass  of  the  data,  our  algorithm  performs  a  birth,  memoized  ascent  steps  for  all  batches,  and 
several  merges  after  the  final  batch.  After  a  few  passes,  it  recovers  high-quality,  compact  structure. 

4  Experimental  results 

We  now  compare  algorithms  for  learning  DP-Gaussian  mixture  models  (DP-GMM),  using  our  own 
implementations  of  full-dataset,  stochastic  online  (SO),  and  memoized  online  (MO)  inference,  as 
well  as  our  new  birth-merge  memoized  algorithm  (MO-BM).  Code  is  available  online.  To  examine 
SO’s  sensitivity  to  learning  rate,  we  use  a  recommended  [1]  decay  schedule  pt  =  {t  +  with 
three  diverse  settings:  a)  k  =  0.5,  d  =  10,  b)  k  =  0.5,  d  =  100,  and  c)«;  =  0.9,  d  =  10. 

4.1  Toy  data:  How  reliably  do  algorithms  escape  local  optima? 

We  first  study  N  =  100000  synthetic  image  patches  generated  by  a  zero-mean  GMM  with  8  equally- 
common  components.  Each  one  is  defined  by  a  25  x  25  covariance  matrix  producing  5x5  patches 
with  a  strong  edge.  We  investigate  whether  algorithms  recover  the  true  K  —  8  structure.  Each  fixed- 
truncation  method  runs  from  10  fixed  random  initializations  with  K  =  25,  while  MO-BM  starts  at 
K  =  \.  Online  methods  traverse  100  batches  (1000  examples  per  batch). 
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Figure  3:  MNIST.  Top:  Comparison  of  final  ELBO  for  multiple  runs  of  each  method,  varying  initialization 
and  number  of  batches.  Stochastic  online  (SO)  compared  at  learning  rates  a,b,c.  Bottom  left:  Visualization  of 
cluster  means  for  MO-BM’s  best  run.  Bottom  center:  Evaluation  of  cluster  alignment  to  true  digit  label.  Bottom 
right:  Growth  in  truncation-level  K  as  more  data  visited  with  MO-BM. 


Fig.  2  traces  the  training-set  ELBO  as  more  data  arrives  for  each  algorithm  and  shows  estimated 
covariance  matrices  for  the  top  8  components  for  select  runs.  Even  the  best  runs  of  SO  do  not 
recover  ideal  structure.  In  contrast,  all  10  runs  of  our  birth-merge  algorithm  find  all  8  components, 
despite  initialization  at  iV  =  1.  The  ELBO  trace  plots  show  this  method  escaping  local  optima,  with 
slight  drops  indicating  addition  of  new  components  followed  by  rapid  increases  as  these  are  adopted. 
They  further  suggest  that  our  fixed-truncation  memoized  method  competes  favorably  with  full-data 
inference,  often  converging  to  similar  or  better  solutions  after  fewer  passes  through  the  data. 

The  fact  that  our  MO-BM  algorithm  only  performs  merges  that  improve  the  full-data  ELBO  is  cru¬ 
cial.  Eig.  2  shows  trace  plots  of  GreedyMerge,  a  memoized  online  variant  that  instead  uses  only  the 
current-batch  ELBO  to  assess  a  proposed  merge,  as  done  in  [4].  Given  small  batches  (1000  exam¬ 
ples  each),  there  is  not  always  enough  data  to  warrant  many  distinct  25  x  25  covariance  components. 
Thus,  this  method  favors  merges  that  in  fact  remove  vital  structure.  All  5  runs  of  this  GreedyMerge 
algorithm  ruinously  accept  merges  that  decrease  the  full  objective,  consistently  collapsing  down  to 
just  one  component.  Our  memoized  approach  ensures  merges  are  always  globally  beneficial. 

4.2  MNIST  digit  clustering 

We  now  compare  algorithms  for  clustering  N  =  60000  MNIST  images  of  handwritten  digits  0-9. 
We  preprocess  as  in  [9],  projecting  each  image  down  to  D  =  50  dimensions  via  PCA.  Here,  we  also 
compare  to  Kurihara’s  public  implementation  of  variational  inference  with  split  moves  [9].  MO-BM 
and  Kurihara  start  at  AT  =  1,  while  other  methods  are  given  10  runs  from  two  AT  =  100  initialization 
routines:  random  and  smart  (based  on  k-means-H-  [17]).  Eor  online  methods,  we  compare  20  and 
100  batches,  and  three  learning  rates.  All  runs  complete  200  passes  through  the  full  dataset. 

The  final  ELBO  values  for  every  run  of  each  method  are  shown  in  Eig.  3.  SO’s  performance  varies 
dramatically  across  initialization,  learning  rate,  and  number  of  batches.  Under  random  initializa¬ 
tion,  SO  reaches  especially  poor  local  optima  (note  lower  y-axis  scale).  In  contrast,  our  memoized 
approach  consistently  delivers  solutions  on  par  with  full  inference,  with  no  apparent  sensitivity  to 
the  number  of  batches.  With  births  and  merges  enabled,  MO-BM  expands  from  AT  =  1  to  over  80 
components,  finding  better  solutions  than  every  smart  K  =  100  initialization.  MO-BM  even  outper¬ 
forms  Kurihara’s  offline  split  algorithm,  yielding  30-40  more  components  and  higher  ELBO  values. 
Altogether,  Eig.  3  exposes  SO’s  extreme  sensitivity,  validates  MO  as  a  more  reliable  alternative,  and 
shows  that  our  birth-merge  algorithm  is  more  effective  at  avoiding  local  optima. 

Eig.  3  also  shows  cluster  means  learned  by  the  best  MO-BM  run,  covering  many  styles  of  each  digit. 
We  further  compute  a  hard  segmentation  of  the  data  using  the  q{z)  from  smart  initialization  runs. 
Each  DP-GMM  cluster  is  aligned  to  one  digit  by  majority  vote  of  its  members.  A  plot  of  alignment 
accuracy  in  Pig.  3  shows  our  MO-BM  consistently  among  the  best,  with  SO  lagging  significantly. 
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Figure  4:  SUN-397  tiny  images.  Left:  ELBO  during  training.  Right:  Visualization  of  10  of  28  learned  clusters 
for  best  MO-BM  run.  Each  column  shows  two  images  from  the  top  3  categories  aligned  to  one  cluster. 


Figure  5:8x8  image  patches.  Left:  ELBO  during  training,  N  —  1.88  million.  Center:  Effective  truncation- 
level  K  during  training,  N  =  1.88  million.  Right:  ELBO  during  training,  N  =  8.64  million. 


4.3  Tiny  image  clustering 

We  next  learn  a  full-mean  DP-GMM  for  tiny,  32  x  32  images  from  the  SUN-397  scene  categories 
dataset  [18].  We  preprocess  all  108754  color  images  via  PCA,  projecting  each  example  down  to 
D  =  50  dimensions.  We  start  MO-BM  at  iL  =  1,  while  other  methods  have  fixed  K  =  100.  Fig.  4 
plots  the  training  ELBO  as  more  data  is  seen.  Our  MO-BM  runs  surpass  all  other  algorithms. 

To  verify  quality,  Fig.  4  shows  images  from  the  3  most-related  scene  categories  for  each  of  several 
clusters  found  by  MO-BM.  For  each  learned  cluster  k,  we  rank  all  397  categories  to  find  those  with 
the  largest  fraction  of  members  assigned  to  k  via  f.^.  The  result  is  quite  sensible,  with  clusters  for 
tall  free-standing  objects,  swimming  pools  and  lakes,  doorways,  and  waterfalls. 


4.4  Image  patch  modeling 

Our  last  experiment  applies  a  zero-mean,  full-covariance  DP-GMM  to  learn  the  covariance  struc¬ 
tures  of  natural  image  patches,  inspired  by  [19,  20].  We  compare  online  algorithms  on  N  =  1.88 
million  8x8  patches,  a  dense  subsampling  of  all  patches  from  200  images  of  the  Berkeley  Seg¬ 
mentation  dataset.  Fig.  5  shows  that  our  birth-merge  memoized  algorithm  started  at  K  —  1  can 
consistently  add  useful  components  and  reach  better  solutions  than  alternatives.  We  also  examined 
a  much  bigger  dataset  of  =  8.64  million  patches,  and  still  see  advantages  for  our  MO-BM. 

Finally,  we  perform  denoising  on  30  heldout  images,  using  code  from  [19].  Our  best  MO-BM  run  on 
the  1.88  million  patch  dataset  achieves  PSNR  of  28.537  dB,  within  0.05  dB  of  the  PSNR  achieved 
by  [19]’s  publicly-released  GMM  with  K  =  200  trained  on  a  similar  corpus.  This  performance  is 
visually  indistinguishable,  highlighting  the  practical  value  of  our  new  algorithm. 


5  Conclusions 


Our  novel  memoized  online  variational  algorithm  avoids  noisiness  and  sensitivity  inherent  in 
stochastic  methods.  Our  birth  and  merge  moves  successfully  escape  local  optima.  These  innovations 
are  applicable  to  common  nonparametric  models  beyond  the  Dirichlet  process. 
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